302 research outputs found

    Efficient known ncRNA search including pseudoknots

    Get PDF
    BACKGROUND: Searching for members of characterized ncRNA families containing pseudoknots is an important component of genome-scale ncRNA annotation. However, the state-of-the-art known ncRNA search is based on context-free grammar (CFG), which cannot effectively model pseudoknots. Thus, existing CFG-based ncRNA identification tools usually ignore pseudoknots during search. As a result, dozens of sequences that do not contain the native pseudoknots are reported by these tools. When pseudoknot structures are vital to the functions of the ncRNAs, these sequences may not be true members. RESULTS: In this work, we design a pseudoknot search tool using multiple simple sub-structures, which are derived from knot-free and bifurcation-free structural motifs in the underlying family. We test our tool on a contiguous 22-Mb region of the Maize Genome. The experimental results show that our work competes favorably with other pseudoknot search methods. CONCLUSIONS: Our sub-structure based tool can conduct genome-scale pseudoknot-containing ncRNA search effectively and efficiently. It provides a complementary pseudoknot search tool to Infernal. The source codes are available at http://www.cse.msu.edu/~chengy/knotsearch

    Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning

    Full text link
    Background: Prokaryotic viruses, which infect bacteria and archaea, are the most abundant and diverse biological entities in the biosphere. To understand their regulatory roles in various ecosystems and to harness the potential of bacteriophages for use in therapy, more knowledge of viral-host relationships is required. High-throughput sequencing and its application to the microbiome have offered new opportunities for computational approaches for predicting which hosts particular viruses can infect. However, there are two main challenges for computational host prediction. First, the empirically known virus-host relationships are very limited. Second, although sequence similarity between viruses and their prokaryote hosts have been used as a major feature for host prediction, the alignment is either missing or ambiguous in many cases. Thus, there is still a need to improve the accuracy of host prediction. Results: In this work, we present a semi-supervised learning model, named HostG, to conduct host prediction for novel viruses. We construct a knowledge graph by utilizing both virus-virus protein similarity and virus-host DNA sequence similarity. Then graph convolutional network (GCN) is adopted to exploit viruses with or without known hosts in training to enhance the learning ability. During the GCN training, we minimize the expected calibrated error (ECE) to ensure the confidence of the predictions. We tested HostG on both simulated and real sequencing data and compared its performance with other state-of-the-art methods specifcally designed for virus host classification (VHM-net, WIsH, PHP, HoPhage, RaFAH, vHULK, and VPF-Class). Conclusion: HostG outperforms other popular methods, demonstrating the efficacy of using a GCN-based semi-supervised learning approach. A particular advantage of HostG is its ability to predict hosts from new taxa.Comment: 16 pages, 14 figure

    Choosing the best heuristic for seeded alignment of DNA sequences

    Get PDF
    BACKGROUND: Seeded alignment is an important component of algorithms for fast, large-scale DNA similarity search. A good seed matching heuristic can reduce the execution time of genomic-scale sequence comparison without degrading sensitivity. Recently, many types of seed have been proposed to improve on the performance of traditional contiguous seeds as used in, e.g., NCBI BLASTN. Choosing among these seed types, particularly those that use information besides the presence or absence of matching residue pairs, requires practical guidance based on a rigorous comparison, including assessment of sensitivity, specificity, and computational efficiency. This work performs such a comparison, focusing on alignments in DNA outside widely studied coding regions. RESULTS: We compare seeds of several types, including those allowing transition mutations rather than matches at fixed positions, those allowing transitions at arbitrary positions ("BLASTZ" seeds), and those using a more general scoring matrix. For each seed type, we use an extended version of our Mandala seed design software to choose seeds with optimized sensitivity for various levels of specificity. Our results show that, on a test set biased toward alignments of noncoding DNA, transition information significantly improves seed performance, while finer distinctions between different types of mismatches do not. BLASTZ seeds perform especially well. These results depend on properties of our test set that are not shared by EST-based test sets with a strong bias toward coding DNA. CONCLUSION: Practical seed design requires careful attention to the properties of the alignments being sought. For noncoding DNA sequences, seeds that use transition information, especially BLASTZ-style seeds, are particularly useful. The Mandala seed design software can be found at

    Designing seeds for similarity search in genomic DNA

    Get PDF
    AbstractLarge-scale comparison of genomic DNA is of fundamental importance in annotating functional elements of genomes. To perform large comparisons efficiently, BLAST (Methods: Companion Methods Enzymol 266 (1996) 460, J. Mol. Biol. 215 (1990) 403, Nucleic Acids Res. 25(17) (1997) 3389) and other widely used tools use seeded alignment, which compares only sequences that can be shown to share a common pattern or “seed’’ of matching bases. The literature suggests that the choice of seed substantially affects the sensitivity of seeded alignment, but designing and evaluating seeds is computationally challenging.This work addresses the problem of designing a seed to optimize performance of seeded alignment. We give a fast, simple algorithm based on finite automata for evaluating the sensitivity of a seed in a Markov model of ungapped alignments, along with extensions to mixtures and inhomogeneous Markov models. We give intuition and theoretical results on which seeds are good choices. Finally, we describe Mandala, a software tool for seed design, and show that it can be used to improve the sensitivity of alignment in practice

    Comparison of characteristics and mortality in multidrug resistant (MDR) and non-MDR tuberculosis patients in China

    No full text
    BACKGROUND: We conducted a cohort study to compare the characteristics of MDR-TB with non-MDR-TB patients and to measure long term (9-year) mortality rate and determine factors associated with death in China. METHODS: We reviewed the medical records of 250 TB cases from a 2001 survey to compare 100 MDR-TB patients with 150 non-MDR-TB patients who were treated in 2001-2002. Baseline attributes extracted from the records were compared between the two cohorts and long-term mortality and risk factors were determined at nine-year follow-up in 2010. RESULTS: Among the 234 patients successfully followed up, 63 (26.9%) were female and 171 (73.1 %) were male. MDR-TB patients had poorer socioeconomic status compared to non-MDRTB. Nine years after the diagnosis of TB, 69 or 29.5 % of the 234 patients had died (32 or 21.6 % of non-MDR-TB versus 37 or 43.0 % of MDR-TB) and the overall mortality rate was 39/1000 per year (PY) (27/1000 PY among non-MDR versus 63/1000 PY among MDR-TB). Factors associated with death included: MDR status (hazard ratio (HR): 1.86; CI: 1.09-3.13), limited education of primary school or lower (HR: 2.51; CI 1.34-4.70) and received TB treatment during the nine-year period (HR 1.82; 95 % CI 1.02-3.26). CONCLUSIONS: MDR-TB was a strong predictor for poor long-term outcome. High quality diagnosis and treatment must be ensured. Greater reimbursement or free treatment may be needed to provide access for the poor and vulnerable populations, and to increase treatment compliance.Funding for the study was provided by the National Centre for Epidemiology and Population Health, Australian National University, as the PhD study project for Yanni Sun, a PhD candidate at the Centre

    PhaBOX: A web server for identifying and characterizing phage contigs in metagenomic data

    Full text link
    Motivation: There is accumulating evidence showing the important roles of bacteriophages (phages) in regulating the structure and functions of microbiome. However, lacking an easy-to-use and integrated phage analysis software hampers microbiome-related research from incorporating phages in the analysis. Results: In this work, we developed a web server, PhaBOX, to comprehensively identify and analyze phage contigs in metagenomic data. To our best knowledge, this is the first web server that supports integrated phage analysis, including detecting phage contigs from the metagenomic assembly, lifestyle prediction, taxonomic classification, and host prediction. Instead of treating the algorithms as a black box, PhaBOX also supports visualization of the essential features for making predictions. With the user-friendly graphical interface, users with or without informatics training can easily use the web server for analyzing phages in microbiome data. Availability: The web server of PhaBOX is available via: https://phage.ee.cityu.edu.hk. The source code of PhaBOX is available via: https://github.com/KennthShang/PhaBOXComment: 5 pages, 1 figur

    Distinct composition and amplification dynamics of transposable elements in sacred lotus (Nelumbo nucifera Gaertn.)

    Get PDF
    Sacred lotus (Nelumbo nucifera Gaertn.) is a basal eudicot plant with a unique lifestyle, physiological features, and evolutionary characteristics. Here we report the unique profile of transposable elements (TEs) in the genome, using a manually curated repeat library. TEs account for 59% of the genome, and hAT (Ac/Ds) elements alone represent 8%, more than in any other known plant genome. About 18% of the lotus genome is comprised of Copia LTR retrotransposons, and over 25% of them are associated with non-canonical termini (non-TGCA). Such high abundance of non-canonical LTR retrotransposons has not been reported for any other organism. TEs are very abundant in genic regions, with retrotransposons enriched in introns and DNA transposons primarily in flanking regions of genes. The recent insertion of TEs in introns has led to significant intron size expansion, with a total of 200 Mb in the 28 455 genes. This is accompanied by declining TE activity in intergenic regions, suggesting distinct control efficacy of TE amplification in different genomic compartments. Despite the prevalence of TEs in genic regions, some genes are associated with fewer TEs, such as those involved in fruit ripening and stress responses. Other genes are enriched with TEs, and genes in epigenetic pathways are the most associated with TEs in introns, indicating a dynamic interaction between TEs and the host surveillance machinery. The dramatic differential abundance of TEs with genes involved in different biological processes as well as the variation of target preference of different TEs suggests the composition and activity of TEs influence the path of evolution

    Research on One Novel Logging Interpretation Method of CBM Reservoir

    Get PDF
    Coalbed methane (CBM) is a kind of natural gas which is stored in the micropores and fractures of the “coal seam” and has not been transported out of the source rock. Conventional logging technology plays an important role in coalbed methane exploration and development. By analyzing the response characteristics of conventional logging of coalbed methane, coal bearing strata are accurately determined. Two methods of statistical model and volume model are established to analyze and calculate industrial components. Based on the study of adsorption isotherm and correlation between logging parameters and coal core gas content, the calculation method of coal seam gas content is determined In practices, the calculation accuracy of industrial components and gas content of coal seam has been significantly improved. Abstract: coalbed methane (CBM) is a kind of natural gas which is stored in the micropores and fractures of “coal seam” and has not been transported out of the source rock. Conventional logging technology plays an important role in coalbed methane exploration and development. By analyzing the response characteristics of conventional logging of coalbed methane, coal bearing strata are accurately determined. Two methods of statistical model and volume model are established to analyze and calculate industrial components. Based on the study of adsorption isotherm and correlation between logging parameters and coal core gas content, the calculation method of coal seam gas content is determined In practice, the calculation accuracy of industrial components and gas content of coal seam has been significantly improved

    2-(2-Iodo­phen­yl)-1,2,3,4-tetra­hydro­isoquinoline-1-carbonitrile

    Get PDF
    In the title compound, C16H13IN2, the two benzene rings make a dihedral angle of 67.26 (5)°. The six-membered heterocycle of the tetra­hydro­isoquinoline unit adopts a half-chair conformation. In the crystal, adjacent mol­ecules are linked by pairs of weak inter­molecular C—H⋯N hydrogen bonds, forming inversion dimers. An intra­molecular C—H⋯I close contact is also observed
    corecore